Introduction:
In this report, our aim is to analyze a criminal dataset from Colchester in 2024-2025 that consists of street level crime incidents and it has been extracted using the UK police website. We will be dwelling into the trends, pattersna and insights that are part of our dataset.
We would explore the dataset that spans over a geographical scope. From low-level theft to serious crime, from urban areas to rural ones. The dataset basically provides a detail overview of the crime and their affects on different areas. We shall identify different trends and patterns and then we can gain a better understanding of the main factors that actually affect the crime rates like time,location and demographics. The analysis shall help us in developing strategies to contribute to a safe environment.
crime_data <- read.csv("~/Desktop/Data Visualization/crime2024-25.csv")
#Finding the column names of crime data
colnames(crime_data)
## [1] "X" "category" "persistent_id" "date"
## [5] "lat" "long" "street_id" "street_name"
## [9] "context" "id" "location_type" "location_subtype"
## [13] "outcome_status"
#Finding the dimensions of crime data
dim(crime_data)
## [1] 6047 13
#Finding the no.of of rows
nrow(crime_data)
## [1] 6047
#Finding the no.of columns
ncol(crime_data)
## [1] 13
Descriptive Analysis:
We are now examining the crime dataset in detail and gaining some useful insight about it.Firstly, we have a total of 6047 records in our dataset which is a huge amount. Moreover, we have a total of 12 variables that represent important features of our criminal crime data.
Variables: We are planning to break down the dataset column names into simple subgroups to provide a concise and brief description:
A. Crime Information:
1.Category:The column represents the type of the crime committed (e.g: burglary, criminal damage arson, posession of weapons, shoplifting, etc..)
persistent_id: The column represents a unique ID that identifies each crime incident.
Date: This column represents the date the incident took place.
B. Location
Additional Information:
PRE-PROCESSING:
Now lets preprocess the data, to ensure that our dataset is ready to perform visualization and analytical analysis. I started exploring the missing values by looking for the proportion in our crime dataset and therefrore i created a table that reflects the proportion of missing values for each variable in our dataset.
#Identifying the missing values
missing_values <- sapply(crime_data, function(x) sum(is.na(x)))
#Verifying if there are any missing values
if (any(missing_values>0)){
missing_portion <- missing_values/nrow(crime_data)
#Creating a summary table
missing_summary <- data.frame(
Variables = names(crime_data),
Missing_Values = missing_values,
Proportion_Missing = missing_portion
)
#Printing statistics
print(missing_summary)
} else {
print("No missing values or NA values found")
}
## Variables Missing_Values Proportion_Missing
## X X 0 0.000000
## category category 0 0.000000
## persistent_id persistent_id 0 0.000000
## date date 0 0.000000
## lat lat 0 0.000000
## long long 0 0.000000
## street_id street_id 0 0.000000
## street_name street_name 0 0.000000
## context context 6047 1.000000
## id id 0 0.000000
## location_type location_type 0 0.000000
## location_subtype location_subtype 0 0.000000
## outcome_status outcome_status 668 0.110468
As per the above analysis, we can check that the proportion of missing values for the ‘context’ variable is 1, which basically means that there isn’t any valye of context in our dataset. Therefore, it won’t contribute to our analysis or visualization at all. So as per our analysis, we shall drop this column entirely and it won’t be part of the visualization.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
#Drop the column with missing values
new_data <- select(crime_data, -context)
We are now done with analysing the dataset and we have now performed pre-processing, we shall now dive deeper into the subparts of the crime data set.
We shall now dive deeeper in the dataset and shall now observe the most important piece of information which is the categories in our dataset. We will try to get an idea to see on how each crime accounts in our dataset i.e. the distribution of the crime categories. For this, we shall create a table that reflects the frequency of the crime categories
#Calculating the frequency table
category_table <- table(crime_data$category)
category_table
##
## anti-social-behaviour bicycle-theft burglary
## 668 151 157
## criminal-damage-arson drugs other-crime
## 466 231 91
## other-theft possession-of-weapons public-order
## 399 58 451
## robbery shoplifting theft-from-the-person
## 81 643 84
## vehicle-crime violent-crime
## 253 2314
#Calculating the category using percentage
category_percentage <- round(100*prop.table(category_table),2)
category_percentage
##
## anti-social-behaviour bicycle-theft burglary
## 11.05 2.50 2.60
## criminal-damage-arson drugs other-crime
## 7.71 3.82 1.50
## other-theft possession-of-weapons public-order
## 6.60 0.96 7.46
## robbery shoplifting theft-from-the-person
## 1.34 10.63 1.39
## vehicle-crime violent-crime
## 4.18 38.27
Using the above table, we shall now observe the percentage of occurances of each crime category and we can then clearly see the most frequently committed crime is ‘violent-crime’ which accounts for 38.27% and the second most frequent is ‘anti-social-behaviour’ which accounts for 11.05% and the third most frequent is ‘shoplifting’. These insights will help the authority to work on the resources more efficiently. For example, they shall assign more police officers in areas where violen-crime is the highest. Moreover, they shall understand the mindset of ‘anti-social behaviour’ and shall take measures to reduce it, like they shall start guidance programs and engagement activities for people that go through it.
All in all, examining the percentage of the type of crime helps in revealing important patterns and trends and helps the authorities in making informed decisions.
We shall now explore the data from a new angle to gain deeper insights. In our dataset, we have a section called ‘date’ which basically consists of the month and the year of the crime. We will examine the date against the category to get an idea. For this, I creates a contigency table to get an idea of the date against the category. We can see the resulting table in the output.
#Creating a two way table
date_category_table <- table(crime_data$category,crime_data$date)
date_category_table
##
## 2024-04 2024-05 2024-06 2024-07 2024-08 2024-09 2024-10
## anti-social-behaviour 70 80 63 53 58 58 56
## bicycle-theft 12 6 9 12 9 12 19
## burglary 10 13 9 18 16 8 17
## criminal-damage-arson 43 63 44 51 39 33 33
## drugs 25 12 12 17 19 25 21
## other-crime 10 12 6 7 9 6 12
## other-theft 34 41 34 33 35 32 38
## possession-of-weapons 5 8 6 5 7 5 3
## public-order 33 32 42 49 53 39 37
## robbery 6 7 9 10 7 10 8
## shoplifting 40 59 42 58 37 47 64
## theft-from-the-person 6 8 7 12 8 4 7
## vehicle-crime 14 13 15 41 52 17 27
## violent-crime 163 214 192 242 184 223 195
##
## 2024-11 2024-12 2025-01 2025-02 2025-03
## anti-social-behaviour 56 44 41 45 44
## bicycle-theft 29 15 9 6 13
## burglary 25 10 6 10 15
## criminal-damage-arson 30 27 28 39 36
## drugs 19 34 18 15 14
## other-crime 4 6 10 5 4
## other-theft 30 36 31 26 29
## possession-of-weapons 2 4 6 1 6
## public-order 36 24 24 40 42
## robbery 6 0 7 8 3
## shoplifting 74 61 50 69 42
## theft-from-the-person 8 9 5 2 8
## vehicle-crime 13 13 14 19 15
## violent-crime 177 209 159 180 176
The contingency table above shows a detail of the crime incidents that has occurred in different months and categories. Firstly, ‘violent-crime’ is the highest rated crime in the entire yet with the value majorly above 150 to 250 throughout the year. While other categories which are ‘anti-social-behavior’, ‘shoplifting’, ‘burglary’ show notable frequencies as they generally occur less than the ‘violent crime’. Furthermore, the type of crime like ‘possession of weapons’, ‘robbery’, ‘theft from person’ occur less frequently as compared to the violent crime. In addition to, some types of crime like ‘vehicle crime’ show fluctuations like in august 2024, the number was 52, whereas in December 2024, it was 13 which basically shows the changes in criminal behavior. Furthermore, ‘robbery’ has a significant consistency majorly throughout the year, with 0 robbery rate in just December 2024 which might indicate that holiday season has a lesser robbery rate. All in all, these observations can help guide specific action, distribution of resources, and policy making to effectively addressing the crime rate and reducing the risks of crime.
#Loading library
library (ggplot2)
#sorting the categories in terms of frequency
sort_categories <- names(sort(category_table, decreasing = TRUE))
#Converting category to a factor
crime_data$category <- factor(crime_data$category,levels=sort_categories)
#Creating a barchart in decreasing order of frequency
ggplot(crime_data, aes(x = reorder(category, as.numeric(category)), fill=category)) + geom_bar() + labs(title = "Frequency of Crime Category", x= NULL, y= "Frequency") + theme(axis.text.x = element_text(angle=45, hjust=1),
plot.title = element_text(hjust = 0.5))
The bar plot above provide a clear idea of the frequency of different categories of crime, with violent crime being on the top, followed by anti-social behavior and shoplifting. The visualization is direct to various stakeholders that includes law enforcement agencies, police departments, heathcare facilities and public. We have presented the data in descending order of frequency, so the plot basically helps in identifying the most common crime issue in the community. The information helps in making an informed decision.
#Loading libraries
library (dplyr)
#Loading libraries
library(ggplot2)
count_outcome <- crime_data %>%
group_by(category, outcome_status) %>%
summarise(count = n ())
## `summarise()` has grouped output by 'category'. You can override using the
## `.groups` argument.
#Plotting enhanced bar plot of outcomes against crime
ggplot(count_outcome, aes(x=category, y=count, fill = outcome_status)) +
geom_bar(stat = "identity", position = "stack") +
labs (title = "Outcomes against Crime",
x= "Category",
y= "Count",
fill = "Outcome Status")+
theme_classic() +
theme(
axis.text.x = element_text(angle=45, hjust=1),
plot.title = element_text(hjust = 0.07))
The bar plot above provides a detailed overview of crime categories and their outcomes, by offering valuable insights that are not very apparent from the data set. We have incorporated colors in the above chart to map outcome status against crime category. The plot enables efficient visualization of complex information in a quick glance.
All in all, these visualization can help the authorities making informed decisions, and they can then allocate resources effectively and can implement the strategies to address crime issues which can help in creating a safe environment.
#Installing libraries
library(ggplot2)
#Installing libraries
library(zoo)
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
#Converting the date column to yearmon Format
crime_data$Month <- as.yearmon(crime_data$date)
#Creating histogram with colored bars
ggplot(crime_data, aes(x= Month, fill = factor(Month))) +
geom_bar(color= "red") +
labs(title = "Frequency of Crime Incidents by Every Month",
x= "Month",
y = "Frequency") +
scale_x_yearmon(format = "%b")+
theme(plot.title = element_text(hjust = 0.5))
The histogram above represents the no.of crimes that have occurred every month which shall help the authorities in making an informed decision. For example, July has the highest no.of crime incident which is more than 600 which may prompt increases police patrols or having a program to find the underlying cause. On the other end, the sudden drop in crime in January which is approximately 400 can help get an idea on how this number is low, this can be because of seasonal trends, weather patterns or some new law enforcement. In conclusion to, by plotting the no.of crime rate against months, we get a detailed idea on how to enhance public safety and what majorly impacts the criminal behavior.
The scatter plot above shows the location of crime with latitude and longitude as y and x-axis respectively, this offers a compelling visual narrative of spatial crime patterns. Each point on the plot represents a specific location of where a crime is committed and every color representing a different category of crime.
#Creating the scatter plot
ggplot(crime_data, aes(x=long, y=lat, color=category))+
geom_point() +
labs(title = "Location of Crime",
x= "Longitude",
y= "Latitude") +
scale_color_discrete(name = "Crime Category")
By examining the plot, clusters of densely packed points emerge, suggesting areas that has a higher incidence of criminal activity. Moreover, the empty spaces indicate the areas with lower or no crime rates. These insights from the visualization can empower authorities to make informed decisions with respect to new laws, resource allocation and community outreach.
By identifying the hot spots of criminal activity, police or law agencies can make informed decisions in a way they can deploy patrols to deter and prevent crime in high risk areas.
Moreover, law agencies and community leaders can use this information to contribute to crime in specific neighborhoods. Therefore, spatial analysis provides the scatter plot equips the authorities to prioritize the responses, ultimately creating a safer and more secure communities.
-SINA PLOT
I’ve plotted the sina plot, which is basically a density plot which is used to display a spatial data,using the density contours. The longitude and latitude are displayed on the x and y axis. The plot illustrates the method of analyzing the slightly higher crime areas.
library(ggplot2)
#Creating Sina plot
ggplot(crime_data, aes(x=long, y= lat)) +
geom_point(alpha = 0.5) +
geom_density_2d(color = "red")+
scale_fill_viridis_c() +
labs(title = "Sina Plot of location",
x= "Longitude",
y= "Latitude") +
theme_minimal()
The contour lines indicate area with a higher density of connected crimes. The regions with the dense contour lines are referred to as “hotspots” due to their higher crime incident density. In contrast, the “cold spots” with lower crime densities are visible areas that are surrounded by thinner contour lines. Law enforcement organizations, city planners and legislators make better decisions if they have a thorough understanding of the density of the crime incidents. This can help in supporting more efficient resource allocation, focused crime prevention initiatives, and the implementation of tactics to lower crime rates in densely populated areas.
-Time Series Plot:
I have created a time series graph showing how the frequency of crime changes over time. I’ve used smoothing to make trend easier to examine. The x-axis displays the dates from April 2024 to May 2025.
#Loading libraries
library(dplyr)
library(ggplot2)
library(zoo)
#Creating the time series graph
#Converting the date column to a proper Date format
crime_data$date <- as.yearmon(crime_data$date)
#Aggregating data by month
crime_counts <- crime_data %>%
group_by(date) %>%
summarise(crime_count = n())
ggplot(crime_counts, aes(x=date, y=crime_count)) +
geom_line() +
geom_smooth(method = "loess", se= FALSE, color = "blue") +
labs(title = "Crime Frequency Over Time with smoothing",
x = "Date",
y = "Number of Crimes") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
The plot ranges from Apr 2024 to Mar 2025, I’ve created a time series graph that shows the frequency of crime changes over time. I’ve used smoothing to make the trend easier to understand. The x-axis display the dates from April 2024 to March 2025, while the y-axis shows the number of crimes. It looks like there’s a pattern of more crimes happening during the summer months (May 2024 to July 2024). Overall, crime frequency seems the highest number of crime is found in July 2024 while the lowest is found in Jan 2025.
#Calculating correlation matrix
correlation_matrix <- cor(crime_data[, c('lat','long','street_id')])
print(correlation_matrix)
## lat long street_id
## lat 1.00000000 -0.12775122 -0.02948315
## long -0.12775122 1.00000000 0.03228818
## street_id -0.02948315 0.03228818 1.00000000
#Creating correlation heatmap
library(corrplot)
## corrplot 0.95 loaded
corrplot(correlation_matrix, method = "color", type = "upper",
tl.col = "black", tl.srt = 45)
- HEATMAP DATA
Based on the heatmap data, all variables shows a strong positive correlation with themselves, as expected with a perfect correlation value of one. The correlation analysis of the dataset revelead some interesting connections between the variables. A small negative correlation was found between latitude and longitude, that suggests that latitude tends to decrease as longitude increases and vice versa. The fact that this association was so small indicates that latitude and longitude are not really dependent on one another. Moreover, the street_id and latitude/longitude had a very weak connections, with correlation coefficients were close to zero. This implies no meaningful relationship between street ID and the geographic coordinates in the dataset. Overall, the weak correlations involving street ID, along with the minor association between latitude and longitude, highlight the general independence of street ID from the dataset’s geographic variables.
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
#Creating the scatter plot with colors mapped to crime categories
int_scat <- ggplot(crime_data, aes(x=long, y=lat, color=category, text=paste("Crime Category: ", category))) + geom_point() + labs(title = "Crime Locations", x = "Longitude", y = "Latitude") + scale_color_discrete(name = "Crime Category")
#Converting to plotly object
int_scat <- ggplotly(int_scat)
int_scat
library(plotly)
#Creating the histogram of outcome status
int_histogram <- ggplot(crime_data, aes(x=outcome_status))+
geom_bar(fill = "blue") +
labs(title = "No.of outcome status",
x = "Outcome Status",
y = "Frequency") +
theme(axis.test.x = element_text(angle = 45, hjust = 1),
plot.title = element_text(hjust = 0.5))
#Convert the ggplot to a plotly object
int_histogram <- ggplotly(int_histogram)
## Warning in ggfun("plot_theme")(plot): The `axis.test.x` theme element is not
## defined in the element hierarchy.
#Display the interactive plot
int_histogram
Mapping crime data reveals important spatial patterns and highlights areas with high crime incidence. These insights can help law enforcement agencies and governments optimize resource allocation and implement targeted interventions directly addressing criminal activity. Crime maps not only show where crimes occur but also offer potential explanations for why they are concentrated in specific locations. Additionally, visualizing crime data supports better collaboration among various stakeholders—such as police departments, local authorities, and NGOs—enabling the development of more comprehensive, coordinated strategies to prevent crime and enhance public safety.
#Load Libraries
library(leaflet)
library(dplyr)
#creating a leaflet map
crime_map <- leaflet(crime_data) %>%
addTiles() %>%
addCircleMarkers(
lng = ~long,
lat = ~lat,
radius = 5,
color = ~category,
fillOpacity = 0.8,
popup = ~paste("Category: ", category)
)
crime_map
Studying a set of criminal offenses involves multiple steps that contribute to a comprehensive understanding of criminal behavior and this shall support informed decision making. The process begins with exploring the structure and features of the data set. Pre-processing tasks, such as handling missing values, helps ensure data integrity and improving the reliability of the findings.
To uncover trends, patterns, and associations within the data, various visualizations are employed, including tables, two way tables, bar plots, histograms, Sina plots and scatter plots. These graphical representations play a crucial in identifying the crime hotspots, understanding temporal variations, and analyzing relationships between variables - valuable insights for strategic planning, tactical decision-making, and efficient resource allocation.
Correlation analysis enhances the process by revealing potential factors that contributes to rising crime rates. Interactive plots and maps also promote user engagement, enabling dynamic exploration of spatial and temporal crime patterns. All in all, the analytical approaches empower stakeholders such as law enforcement to make evidence-based decisions and developing effective crime prevention strategies.
CLIMATE DATA
Introduction: This report presents an analysis of climate data collected from a weather station near Colchester. Our aim is to explore the region’s climate by identifying the trends, patterns and key insights within the dataset. By examining and visualizing the data related to temperature, precipitation, wind and humidity, we can gain a clear understanding of weather patterns, seasonal variation, and long-term climate change in the Colchester area.
The analysis enables stakeholders to identify climate-related risks, vulnerabilities and opportunities for adaptation. The insights derived from the data can inform well-founded decisions across various sectors, including infrastructure planning, disaster preparedness, tourism and agriculture. All in all, this contributes to enhancing the resilience of Colchester and its surrounding areas in the face of climate and change.
#Loading the data
climate_data <- read.csv("~/Desktop/Data Visualization/temp2024-25.csv")
#finding the dimensions
dim(climate_data)
## [1] 365 18
#finding the no.of rows
nrow(climate_data)
## [1] 365
#finding the no.of columns
ncol(climate_data)
## [1] 18
#finding the no.of columns
colnames(climate_data)
## [1] "station_ID" "Date" "TemperatureCAvg" "TemperatureCMax"
## [5] "TemperatureCMin" "TdAvgC" "HrAvg" "WindkmhDir"
## [9] "WindkmhInt" "WindkmhGust" "PresslevHp" "Precmm"
## [13] "TotClOct" "lowClOct" "SunD1h" "VisKm"
## [17] "SnowDepcm" "PreselevHp"
Let’s take a closer look at the contents of this climate dataset to uncover some meaningful insights. We begin with a simple descriptive analysis to understand the basic characteristics of the data. The dataset contains a total of 365 records by providing a full year of observations which is a solid foundation for analysis.
There are 16 variables in the dataset, each representing an important aspect of climate measurement. Below is a brief overview of variables: - station_ID : Identifier for the weather station where the data was collected. - Date: The date on which the weather observations were recorded. - TemperatureCAvg: Average Temperature recorded on the date. - TemperatureCMax: Maximum Temperature recorded on the date. - TemperatureCMin: Minimum Temperature recorded on the date. - TdAvgC: Average dew point temperature. - HrAvg: Average relative humidity. - WindkmhDir: Wind direction recorded on that date. - WindkmhInt: Average wind speed (in kilometres per hour) - WindkmhGust: Maximum wind gust speed (in kilometres per hour) - PresslevHp: Atmospheric pressure (in hectopascals) - Precmm: Total precipitation (in millimeters) - TotClOct: Total cloud cover observed - SunD1h: Sunshine duration (in hours) - VisKm: Visibility (in kilometers)
Now let’s move on to preprocessing the data, a critical step to ensure the dataset is ready for effective visualization and analysis. I began by investigating the presence of missing values in the climate dataset. To do this, I generated a table that displays the proportion of missing values for each variable, helping to identify which features may require cleaning or imputation before further analysis.
# Identifying the missing values
missing_val <- sapply(climate_data, function(x) sum(is.na(x)))
# Determine the proportion of missing values
missing_proportion <- missing_val / nrow(climate_data)
# Creating a summary table
missing_summary <- data.frame(Variables = names(climate_data),
Missing_Values = missing_val,
Proportion_Missing = missing_proportion)
#Printing the summary of statistics
print(missing_summary)
## Variables Missing_Values Proportion_Missing
## station_ID station_ID 0 0.000000000
## Date Date 0 0.000000000
## TemperatureCAvg TemperatureCAvg 0 0.000000000
## TemperatureCMax TemperatureCMax 0 0.000000000
## TemperatureCMin TemperatureCMin 0 0.000000000
## TdAvgC TdAvgC 0 0.000000000
## HrAvg HrAvg 0 0.000000000
## WindkmhDir WindkmhDir 0 0.000000000
## WindkmhInt WindkmhInt 0 0.000000000
## WindkmhGust WindkmhGust 0 0.000000000
## PresslevHp PresslevHp 0 0.000000000
## Precmm Precmm 23 0.063013699
## TotClOct TotClOct 0 0.000000000
## lowClOct lowClOct 9 0.024657534
## SunD1h SunD1h 1 0.002739726
## VisKm VisKm 0 0.000000000
## SnowDepcm SnowDepcm 350 0.958904110
## PreselevHp PreselevHp 365 1.000000000
From the results we came to know that missing proportion for the ‘PreselevHp’ is 1 and SnowDepcm is 0.96. This means that have almost all their values missing. Hence, it won’t contribute to our analysis at all. So according to our analysis, I dropped these columns entirely and it won’t be a part of my report at all.
updated_data <- climate_data[, !(colnames(climate_data) %in% c("PreselevHp", "SnowDepcm"))]
colnames(updated_data)
## [1] "station_ID" "Date" "TemperatureCAvg" "TemperatureCMax"
## [5] "TemperatureCMin" "TdAvgC" "HrAvg" "WindkmhDir"
## [9] "WindkmhInt" "WindkmhGust" "PresslevHp" "Precmm"
## [13] "TotClOct" "lowClOct" "SunD1h" "VisKm"
#Two Way Table
temp_two_way_table <- climate_data %>%
group_by(station_ID, Date) %>%
summarise(
TemperatureCAvg = mean(TemperatureCAvg, na.rm = TRUE),
TemperatureCMax = mean(TemperatureCMax, na.rm = TRUE),
TemperatureCMin = mean(TemperatureCMin, na.rm = TRUE),
.groups = "drop"
) %>%
ungroup()
#View(temp_two_way_table)
To visualize the climate data, I made a very simple visualization of the distribution of wind direcion using a bar plot. The x-axis shows how many times the wind blows from a specific direction, and the y-axis shows the different wind directions.
library(ggplot2)
#Creating the bar plot
ggplot(climate_data, aes(x = factor(WindkmhDir))) +
geom_bar(fill = "blue", color = "black") +
labs(title = "Distribution of Wind Direction",
x = "Wind Direction",
y = "Count") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
plot.title = element_text(hjust = 0.5),
axis.title = element_text(size = 12, face = "bold"),
axis.text = element_text(size = 10),
legend.position = "none") +
coord_flip()
The bar plot above, it is quite evident that SW (Southwest) direction had the highest frequency of occurrence among all the observed wind directions in the dataset.
To visualize the distribution of wind intensity and wind gust intensity, I combined both variables into a single column and created a unified histogram. The x-axis represents wind intensity values, while the y-axis indicates their frequency of occurrence.
library(ggplot2)
library(tidyr)
#Combining the data of WindkmhInt & WindkmhGust
combined_data <- climate_data %>%
pivot_longer(cols = c(WindkmhInt, WindkmhGust), names_to = "Variable", values_to = "Windkmh")
#Histogram of the combined data
ggplot(combined_data, aes(x= Windkmh, fill = Variable)) +
geom_histogram(binwidth = 5, position = "dodge", color = "black") +
labs(title = "Distribution of Wind Gust Intensity and Wind Intensity",
x= "Wind Intensity",
y = "Frequency",
fill = "Variable")+
scale_fill_manual(values = c("skyblue", "salmon"))+
theme_minimal()
By analysing the above histogram, we can clearly view that wind intensity range of 50 km/h has a bar of almost 26, which means there are approximately 30 observations for it.
# Creating a box plot for Average Temperature and Max Temperature
ggplot(data = climate_data, aes(x = factor(WindkmhDir), y = TemperatureCAvg)) +
geom_boxplot(fill = "skyblue", color = "black") +
labs(title = "Box Plot of Average Temperature",
x = "Wind Direction",
y = "Average Temperature")
The above boxplot is plotted to get a better understanding of how the wind direction impacts the average temperature, I decided to use a box plot, as it is quite evident that westerly winds tend to have a wide range of temperature as compared to the easterly winds E, ENE, NE and NNE.This proves that westerly winds might lead to fluctuating temperatures while easterly winds are linked to consistent temperature range.
#Creating a scatter plot
ggplot(climate_data, aes(x=TemperatureCAvg, y=TdAvgC)) +
geom_point()+
labs(title = "Scatter Plot Temperature vs Dew Point Temperature",
x = "Average Temperature (degC)",
y = "Average Dew Point Temperature (degC)")
I made a scatter plot to visualize the average temperature and average dew point temperature. The points on the scatter plot proves that their is a relationship between the variables. We can then see a positive correlation between the two as the average temperature when dew point temperature increases.
#Defining the variables selected
selected_variables <- c("TemperatureCAvg", "WindkmhInt", "PresslevHp")
#Creating a subset of the dataset
selected_data <- climate_data[, selected_variables]
#Calculating and displaying the correlation matrix
correlation_matrix <- cor(selected_data)
correlation_matrix
## TemperatureCAvg WindkmhInt PresslevHp
## TemperatureCAvg 1.0000000000 -0.0003929393 -0.2217310
## WindkmhInt -0.0003929393 1.0000000000 -0.3952205
## PresslevHp -0.2217309859 -0.3952204817 1.0000000
corrplot(correlation_matrix, method = "color", type = "upper",
tl.col = "black", tl.srt = 45)
I have performed correlation analysis on the Average Temperature, Wind & Pressure. I shall then create a heatmap of the correlation matrix to aid in visualization of coefficients. From analyzing the heatmap above, we can see that the light red color indicates a weak correlation, therefore as temperature increases, pressure decreases, even though the relation is not strong. Moreover, wind and pressure has a weak negative correlation. If the winds are higher then it is associated with low pressure system. Temperature and wind have little or no correlation.
#Convert Date column to Date Format
climate_data$Date <- as.Date(climate_data$Date)
#Check the structure of data
str(climate_data)
## 'data.frame': 365 obs. of 18 variables:
## $ station_ID : int 3590 3590 3590 3590 3590 3590 3590 3590 3590 3590 ...
## $ Date : Date, format: "2025-03-31" "2025-03-30" ...
## $ TemperatureCAvg: num 9.2 9.6 8.3 10 8.3 8.6 6.5 9 10.8 12 ...
## $ TemperatureCMax: num 15.8 13.6 14.3 16.4 15.2 12.9 12.7 13.9 16.1 14.9 ...
## $ TemperatureCMin: num 3 2.9 2.9 2.6 2.6 1.2 1.2 5.2 5.2 7.3 ...
## $ TdAvgC : num 2.7 3 1.8 6.3 4.4 6.8 3.1 6.9 7.8 6.8 ...
## $ HrAvg : num 66.8 64.6 66.2 78.4 77.8 88.1 81.5 86.9 82.8 70.7 ...
## $ WindkmhDir : chr "NW" "WSW" "WNW" "SW" ...
## $ WindkmhInt : num 20.2 21.5 19.1 18.7 11 10.3 12.4 16.6 10.8 16.6 ...
## $ WindkmhGust : num 53.7 50 40.8 38.9 35.2 29.7 31.5 35.2 31.5 40.8 ...
## $ PresslevHp : num 1023 1018 1014 1016 1024 ...
## $ Precmm : num 0 0 0 0 0 0 0.2 1 0.2 0.2 ...
## $ TotClOct : num 1.5 3.6 3.1 1.3 0.9 6.4 2.1 7.3 6.7 4.2 ...
## $ lowClOct : num 4.5 6.2 6.2 8 3 7.7 6.4 7.3 7 7.3 ...
## $ SunD1h : num 9.6 10.1 3.9 11.4 10.9 4.1 6.4 0.1 3.1 5.1 ...
## $ VisKm : num 29.2 47.5 49.4 19.9 40.5 11.5 6.7 4.3 16.5 18.5 ...
## $ SnowDepcm : int NA NA NA NA NA NA NA NA NA NA ...
## $ PreselevHp : logi NA NA NA NA NA NA ...
ggplot(climate_data, aes(x=Date, y=TemperatureCAvg)) +
geom_line() +
labs (title = "Time Series of Average Temperature",
x = "Date",
y = "Average Temperature") +
theme_minimal()
I created a time series plot of average temperature against time. The x-axis represents ‘Date’ and y-axis’ represents ‘Temperature’. The time series graph displays the temperature for a 12 month period starting from April 2024 to April 2025. As per the graph above, it is quite evident that the temperature rises in summers and spring and then decreases during the winters. The average temperature used to vary from -2 to 18 degrees celcius throughout the year.
climate_data$Date <- as.Date(climate_data$Date)
#Adding smoothing to the time series graph
ggplot(climate_data, aes(x = Date, y = TemperatureCAvg)) +
#Adding the line
geom_line() +
# Adding smoothing
geom_smooth(method = "loess", se = FALSE) +
labs(title = "Time Series of Average temperature with Smoothing",
x = "Date",
y = "Average Temperature") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
- Histogram with Smoothed Curve
ggplot(climate_data, aes(x= TemperatureCAvg)) +
geom_histogram(binwidth = 1, fill = "yellow", color = "black") +
geom_density(aes(y= ..count..), fill = "red", alpha = 0.5) +
labs (Title = "Average Tenmperature (With Smoothed Density Curve)",
x = expression("Average Temperature (" * degree * "C)"),
y = "Frequency")+
theme_minimal()
## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
library(ggplot2)
library(tidyr)
library(plotly)
#Combining the WindkmhInt and WindkmhGust variable
combining_data <- climate_data %>%
pivot_longer(cols = c(WindkmhInt, WindkmhGust), names_to = "Variable", values_to = "Windkmh")
#Creating a histogram
int_histm <- ggplot(combining_data, aes(x=Windkmh, fill = Variable)) +
geom_histogram(binwidth = 5,position = "dodge", color = "red") +
labs(title = "Distribution of Wind intensity and Wind Gust intensity",
x = "Wind Intensity (km/h)",
y = "Frequency",
fill = "variable")+
scale_fill_manual(values = c ("skyblue","salmon")) +
theme_minimal()
#Coverting the ggplot to plotly
int_histm <- ggplotly(int_histm)
#Display the histogran
int_histm
The above histogram illustrates the distribution of two wind related variables: WindkmhInt (regular wind intensity) and WindkmhGust (wind gust activity), both measured in kilometers per hour. The plot above shows that WindkmhInt are concentrated in the lower age that is between 10 and 25 km/h with a peak around 15 km/h. In contrast, WindkmhGust are more spread out, with higher frequencies in the 25 - 45 km/h range and extending beyond 75 km/h. This suggests that the win gusts tend to be stronger and more variable than regular wind intensities.
library(ggplot2)
library(plotly)
climate_data$Date <- as.Date(climate_data$Date)
int_time <- ggplot(climate_data, aes(x=Date, y=TemperatureCAvg)) +
geom_line() +
geom_smooth(method = "loess", se= FALSE) +
labs(title = "Time Series of Average Temperature with Smoothing",
x = "Date",
y = "Average Temperature") +
theme_minimal()
int_time <- ggplotly(int_time)
## `geom_smooth()` using formula = 'y ~ x'
int_time
library(dplyr)
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
#Reading Data
crime_data <- read.csv("~/Desktop/Data Visualization/crime2024-25.csv", stringsAsFactors = FALSE)
climate_data <- read.csv("~/Desktop/Data Visualization/temp2024-25.csv", stringsAsFactors = FALSE)
# Converting Dates to Monthlu
crime_data$Month <- format(as.Date(paste0(crime_data$date, "-01")), "%Y-%m")
# Summarising monthly crime count
monthly_crime <- crime_data %>%
group_by(Month) %>%
summarise(CrimeCount = n())
# Format climate dates
climate_data$Date <- as.Date(climate_data$Date)
climate_data$Month <- format(climate_data$Date, "%Y-%m")
# Sumarising monthly average temperature using correct column name
monthly_climate <- climate_data %>%
group_by(Month) %>%
summarise(AvgTemp = mean(TemperatureCAvg, na.rm = TRUE))
# Joining data sets
crime_weather <- left_join(monthly_crime, monthly_climate, by = "Month")
# Printing results
print(crime_weather)
## # A tibble: 12 × 3
## Month CrimeCount AvgTemp
## <chr> <int> <dbl>
## 1 2024-04 471 9.08
## 2 2024-05 568 13.4
## 3 2024-06 490 14.3
## 4 2024-07 608 16.5
## 5 2024-08 533 18.1
## 6 2024-09 519 14.7
## 7 2024-10 537 11.7
## 8 2024-11 509 7.24
## 9 2024-12 492 6.50
## 10 2025-01 408 3.45
## 11 2025-02 465 4.46
## 12 2025-03 447 6.96
Two Way Table
#Converting the date
crime_data$date_full <- as.Date(paste0(crime_data$date, "-01"))
#Use wday()
table(crime_data$category, wday(crime_data$date_full, label = TRUE))
##
## Sun Mon Tue Wed Thu Fri Sat
## anti-social-behaviour 102 123 56 121 58 56 152
## bicycle-theft 27 24 19 15 9 29 28
## burglary 18 28 17 19 16 25 34
## criminal-damage-arson 60 94 33 91 39 30 119
## drugs 59 42 21 30 19 19 41
## other-crime 12 17 12 22 9 4 15
## other-theft 68 67 38 72 35 30 89
## possession-of-weapons 9 10 3 14 7 2 13
## public-order 63 82 37 56 53 36 124
## robbery 10 16 8 14 7 6 20
## shoplifting 108 98 64 109 37 74 153
## theft-from-the-person 13 18 7 13 8 8 17
## vehicle-crime 30 55 27 27 52 13 49
## violent-crime 432 405 195 373 184 177 548
-Conclusion
The comprehensive study of the climate dataset provided us with some valuable insights into key variables that enhance our understanding of local climate dynamics. Descriptive analysis offered a solid foundation by illustrating the dataset’s characteristics, while thoroughly pre-processing to ensure data integrity. The use of two-way tables, bar plots, histogram, bar plots, scatter plots, time-series plot & correlation analysis effectively highlight highlights patterns, trends, and relationships within the data. The visualizations revealed temperature changes over the years, precipitation levels, and variations in wind speed across different years and location, enabling identification of seasonal fluctuation and potential correlations. Furthermore, the incorporation of interactive plots made the analysis more accessible, fostering a deep connection betweeen viewers and the data. All in all, the extensive research serves as a valuable reference for informed decision making across various industries by uncovering weather related patterns which eventually allows to have a proactive approach to risks and opportunities.